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Abstract 

Although computationally aligning sequence is a crucial step in the vast majority of comparative ge- 
nomics studies our understanding of alignment biases still needs to be improved. To infer true structural 
or homologous regions computational alignments need further evaluation. It has been shown that the 
accuracy of aligned positions can drop substantially in particular around gaps. Here we focus on re- 
evaluation of score-based alignments with affine gap penalty costs. We exploit their relationships with 
pair hidden Markov models and develop efficient algorithms by which to identify gaps which are signif- 
icant in terms of length and multiplicity. We evaluate our statistics with respect to the well-established 
structural alignments from SAB mark and find that indel reliability substantially increases with their sig- 
nificance in particular in worst-case twilight zone alignments. This points out that our statistics can 
reliably complement other methods which mostly focus on the reliability of match positions. 

Introduction 

Having been introduced over three decades ago ll22l the sequence-alignment problem has remained one of 
the most actively studied topics in computational biology. While the vast majority of comparative genomics 
studies crucially depend on alignment quality inaccuracies abundantly occur. This can have detrimental ef- 
fects in all kinds of downstream analyses |18|. Still, our understanding of the involved biases remains rather 
rudimentary Ifl6l . lfl9l . That different methods often yield contradictory statements [9] further establishes 
the need for further investigations into the essence of alignment biases and their consequences ifTol . 

While the sequence-alignment problem virtually is that of inferring the correct placement of gaps, inser- 
tions and deletions (indels) have remained the most unreliable parts of the alignments. For example, Lunter 
et al. lfl9l . in a whole-genome alignment study, observe 96% alignment accuracy for alignment positions 
which are far away from gaps while accuracy drops down to 56% when considering positions closely sur- 
rounding gaps. They also observe a downward bias in the number of inferred indels which is due to effects 
termed gap attraction and gap annihilation. Decreased numbers of inferred indels were equally observed in 
other recent studies lfT7ll25l . This points out that numbers and size of computationally inferred indels can 
make statements about alignment quality. 

The purpose of this paper is to systematically address such questions. We develop a statistical framework 
by which to efficiently compute probabilities of the type 

^{h,A{x,y) > k | L A (x,y) = n,Sim A (x,y) G [01,02]) (1) 
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where (x, y) has been randomly sampled from an appropriate pool of protein pairs. In the following pools 
contain protein pairs which have a (either false or true positive) structural SABMark [28 ] (see below) align- 
ment. In case of, for example, all pairs of human proteins, £T|) would act as null distribution for human. A is 
a local or global optimal, score-based alignment procedure with affine gap penalties such as the affine gap 
cost version of the Needleman-Wunsch (NW) algorithm j22l [T4l or the Smith- Waterman (SW) algorithm 
Il30ll32l . L_a(x, y) is the length of the alignment, Smi^x, y) denotes alignment similarity that is the frac- 
tion of perfectly matching and "well-behaved" mismatches vs. "bad" mismatches (as measured in terms of 
biochemical affinity l23l ) and gap positions. IcI,a( x > y) finally denotes the length of the d-th longest gap in 
the alignment. In summary, (Q]) can be read as the probability that a NW resp. SW alignment of length n and 
similarity between o\ and 02 contains at least d gaps of length k and the reasoning is that gaps which make 
part of significant such gap combinations are more likely to reflect true indels. Significance is determined 
conditioned the length L(x,y) of the alignment as well as alignment similarity Sim(x,y). The reasoning 
behind this is that longer alignments are more likely to accumulate spurious indels such that only increased 
gap length and multiplicity are significant signs of true indels. Increased similarity Sim(x,y), however, 
indicates that already shorter and less gaps are more likely to reflect true indels simply because an alignment 
of high similarity is an overall more trustworthy statement. In summary, we provide a statistically sound, 
systematic approach to answering questions such as "Am I to believe that 4 gaps of size at least 6 in an 
alignment of length 200 and similarity 50 are likely to reflect true indels" as motivated by the recent studies 

im nulla. 

We opted to address these questions for score-based alignments with affine gap costs for two reasons: 

1. To employ score-based such alignments still is a most popular option among most bioinformatics 
practitioners. 

2. Such alignments can be alternatively viewed as Viterbi paths in pair HMMs. While exact statistics on 
Viterbi paths are hard to obtain and beyond the scope of this study we obtain reasonable approxima- 
tions by "Viterbi training" sensibly modified versions of the hidden Markov chains which underlie the 
pair HMMs. 

We evaluate our statistics on the well-established SAB mark [28] alignments. SAB mark is a database of 
structurally related proteins which cover the entire known fold space. The "Twilight Zone set" was partic- 
ularly designed to represent the worst case scenario for sequence alignment. While we obtain good results 
also in the more benign "Superfamilies set" of alignments it is that worst case scenario of twilight zone 
alignments where our statistics prove their particular usefulness. Here significance of gap multiplicity is 
crucial while significance of indel length alone does not necessarily indicate enhanced indel quality. 

Related Work 

ll20l re-evaluate match (but not indel) positions in global score-based alignments by obtaining reliability 
scores from suboptimal alignments. Similarly, l29l derive reliability scores also for indel positions in global 
score-based alignments. However, the method presented in |29ll reportedly only works in the case of more 
than 30% sequence identity. Related work where structural profile information is used is lOTI whereas Q 
re-align rather than re-evaluate. 

Posterior decoding algorithms (see e.g. lfl"0l[T9l l3l for most recent approaches) are related to re-evaluation 
of alignments insofar as posterior probabilities can be interpreted as reliability scores. However, how to score 
indels as a whole by way of posterior decoding does not have a straightforward answer. We are aware of the 
potential advantages inherent to posterior decoding algorithms — it is work in progress of ours to combine 
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the ideas of pair HMM based posterior decoding aligners with the ideas from this stud)Lj. 

To assess statistical significance of alignment phenomena is certainly related to the vastly used Altschul- 
Dembo-Karlin statistics lfl5l [8] [Q where score significance serves as an indicator of protein homology. 

To devise computational indel models still remains an area of active research (e.g. l26l |U 21 Q21 EH). 
However, the community has not yet come to a final conclusion. 

Last but not least, the algorithms presented here are related to the algorithms developed in ll27l where 
the special case of d = 1 for only global alignments in © was treated to explore the relationship of indel 
length and functional divergence. The advances achieved here are to provide null models also for the more 
complex case of local alignments and to devise a dynamic programming approach also for the case d > 1 
which required to develop generalized inclusion-exclusion arguments. 

Just like in E71 note that empirical statistics approaches fail for the same reasons that have justified the 
development of the Altschul-Dembo-Karlin statistics: sizes of samples are usually much too small. Here 
samples (indels in alignments) are subdivided into bins of equal alignment similarity and then further into 
bins of equal length n and ci-th longest indel size k. 

Summary of Contributions 

As above-mentioned, our work is centered around computation of probabilities 

P(IdO, y)>k\ L(x, y) = n, Sim(x, y) G [a u cr 2 ]). (2) 

We refer to this problem as Multiple Indel Length Problem (MILP) in the following. Our contributions then 
are as follows: 

1. We are the first ones to address this problem and derive appropriate Markov chain based null models 
from the pair HMMs which underlie the NW resp. SW algorithms to yield approximations for the probabil- 
ities ©. 

2. Despite having a natural formulation, the inherent Markov chain problem had no known efficient 
solution. We present the first efficient algorithm to solve it. 

3. We demonstrate the usefulness of such statistics by showing that significant gaps in both global and 
local alignments indicate increased reliability in terms of identifying true structural indel positions. This 
became particularly obvious for worst-case twilight zone alignments of at most 25% sequence identity. 

4. Thereby we deliver statistical evidence of that computational alignments are biased in terms of num- 
bers and sizes of gaps as described in |[T9ll25l . In particular too little numbers of gaps can reflect alignment 
artifacts. 

5. Re-evaluation of indels in score-based both local and global alignments had not been explicitly 
addressed before, in particular, reliable solutions for worst-case twilight zone alignments were missing. Our 
work adds to (rather than competes with) the above-mentioned related work. 

In summary, we have complemented extant methods for score-based alignment re-evaluation. Note 
that none of the existing methods explicitly addresses indel reliability but rather focus on the reliability of 
substitutions. 

Methods 

Pair HMMs and Viterbi Path Statistics 

In the following we only treat the more complex case of local Smith- Waterman alignments. See 11271 for the 
case of global Needleman-Wunsch like alignments and Fig. 12. II for a picture of the corresponding Markov 

1 Note that although we derive statistical scores for indels as a whole our evaluation in the Results section will refer to counting 
individual indel positions. 
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(a) Standard pair HMM 

q 5 




(b) Markov Chain 

Figure 1: Standard pair HMM corresponding to local Smith- Waterman alignments and the Markov Chain 
whose generative statistics, after Viterbi training, approximate the Viterbi statistics of the pair HMM for 
local alignments. 




chain. For a treatment of Needleman-Wunsch aligments the Markov chain in Fig. 12.11 has to be, mutatis 
mutandis, plugged into the computations of the subsequent subsections. 

A local Smith- Waterman alignment with affine gap penalties of two sequences x = x\...x w ,y = y\...y z 
is associated with the most likely sequence of hidden states (i.e. the Viterbi path) in the pair HMM of 
Fig. l(a)| ifTTl . The path of hidden states translates to an alignment of the two sequences by emitting the 



necessary symbols along the run. Statistics on Viterbi paths in HMMs pose hard mathematical problems 
and have not been fully understood. In analogy to [27], we construct a Markov chain whose common, 
generative statistics mimick the Viterbi statistics of interest here. Hence probabilities derived from this 
Markov chain serve as approximations of ©. We do this by the following steps: 

1. We take the Markov chain of the pair HMM in Fig. |l(a)| as a template. 

2. We add two match states Ml, M3. The original match state is M2. 

3. We merge the initial resp. terminal regions into one start resp. end state. 

4. We collapse states X and Y into one indel state I. 

The Markov chain approach is justified by the fact that consecutive runs in Viterbi paths are approxi- 
mately governed by the geometric distribution which is precisely what a Markov chain reflects. To see this 
note that to stay with a state in a Viterbi path is, approximately, associated with that a self-transition attains 
maximum probability in the next step. This depends both on the original transition probability and the back- 
ward probability which depends on the observed subsequence to follow starting from that state (see |[T2l . 
(4.30) and the related discussion). Since it is a general, computational assumption that the background dis- 
tribution on observed symbols (amino acids) is position-independent Viterbi path transitions can be assumed 
to be (approximately) position-independent, too. Note that assuming sequence to be position-independent is 
reflected by that scoring schemes are position-independent. Clearly, this is a computational assumption — we 
are aware of that the biological reality can be different. 

The second point is to take into account the non-stationary character of the original Markov chain. Note 
that in local alignments, initial and final consecutive stretches of (mis)matches are longer than intermediate 



(mis)match stretches which translates to q\,q% > q§ in Fig. |l(a)| To see this in more detail, note first that the 
related discussion in |[T2l is on stationary HMMs. The non-stationarity of the pair HMM under consideration 
here is due to that the initial and terminal regions are heavily position-dependent. As a result, the Viterbi 
paths under consideration have a memory which can contradict the Markov assumption. The most striking 
effect is that 

F(X t+1 = I\X t = M, X t _! £ {M, I}) > F(X t+1 = I\X t = M, X t ^ = RY1). (3) 

which reflects that to open up a gap shortly after having initiated the core alignment tends to be avoided 
in order to circumvent an early gap penalty. In symbols, this means that it is more likely to postpone the 
core alignment and see (RY1)(RY1)(RY1) (and, possibly, some more (RX1) before that) than running into 
an early gap after alignment initiation (RYl)MI. Similar considerations hold for the terminal regions. The 
second point addresses this by adding initial and terminal match regions Ml and M3 which take the non- 
stationary character of these areas into account. Point 3 merely reflects that we are only interested in statistics 
on alignment regions. Point 4 finally accounts for that we do not make a difference between insertions and 
deletions due to the involved symmetry (relative to exchanging sequences). 

Algorithmic Solution of the MILP 

We define C n k ^ to be the set of sequences over the alphabet B, Ml, I, M2, M3, E (for Begin, Matchl, Indel, 
Match2, Match3 and End) of length n that contain at least d consecutive I stretches of length at least k. Let 
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A n := {X n = M3, X n+ \ = E} be the set of sequences with an alignment region of length n. We then 
suggest the following procedure to compute approximations of the probabilities © where T(cri, 02) is sup- 
posed to be a pool of protein pairs (x, y) whose alignments exhibit alignment similarity Sim(x, y) G \cr\ , 02] ■ 

1: Compute alignments for all sequence pairs in T{a\, o^)- 

2: Infer parameters q\,q2, Q3,Qi, Q5, Q6 of the Markov chain by Viterbi training it with the alignments. 
3: n <— length of the alignment of x and y 

4: Compute F(C n> k,d H A n ) as well as ¥(A n ), the probabilities that the Markov chain of Fig. |l(b)| 
generates sequences from C n ^,d H A n and A n 
5: Output 

r(pn,M I A n ) - — -r (4) 

as an approximation for (O. 

The idea of step 1 and 2 is to specifically train the Markov chain to generate alignments from the pool 
T(o"2) &2)- m our setting, Viterbi training translates to counting M\-\o-M\, M\-to-I, I-to-I, I-to-M2, M2- 
to-M2 and M^-to-M^ transitions in the alignments under consideration to provide maximum likelihood 
estimates for gi, 52, 93, <?4, Q5 and q^. 



Efficient Computation of ¥{C n ^,d H A n ) 

The problem of computing probabilities of the type (f2l) has been made the problem of computing the prob- 
ability that the Markov chain generates sequences from C n ^d ^ A n and A n . While computing 

F(An) = F(X n = M 3 ) • nXn+i = E I X n = M 3 ) (5) 

is an elementary computation, the question of efficient computation and/or closed formulas for probabilities 
of the type P(C n ^ d H A n ) had not been addressed in the mathematical literature and poses a last, involved 
problem. 

The approach taken here is related to the one taken in l27l . which treated the special case of single 
consecutive runs (i.e. d = 1) in the context of the two-state Markov chains which reflect null models for 
global alignments. We generalize this in two aspects. First, we provide a solution for more than two states 
(our approach applies for arbitrary numbers of states). Second, we show how to deal with multiple runs. 

The probability event design trick inherent to our solution was adopted from that of [24]. The solution 
provided in ll24l can be used for the (rather irrelevant) case of global alignments with linear gap penalties, 
i.e. gap opening and extension are identically scored. See also lfT3l for related mathematical treatments 
of the i.i.d. case. 

In the following, let i, j £ {B, Mi , I, M2, M3, E} be indices ranging over the alphabet of Markov chain 
states. Let ej G M 6 be the standard basis vector of M 6 having a 1 in the i-th component and zero elsewhere. 
For example, e\ = (0, 0, 1, 0, 0, 0), eM 3 = (0,0,0,0,1,0). We furthermore denote the standard scalar 
product on M 6 by (.,.}. 

Efficient computation of the probabilities P(C n f.,d H A n ) is obtained by a dynamic programming ap- 
proach. As usual, we collect the Markov chain parameters (in accordance with Fig. |l(b)| ) into a state transi- 
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tion probability matrix 



P = {Pi. 
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(6) 



and an initial probability distribution vector n = 6b = (1, 0, 0, 0, 0, 0) T . The initial distribution reflects that 
we start an alignment from the 'Begin' state. More formally, F(Xq = B) = 1. For example, according to 
the laws that govern a Markov chain, the probability of being in the indel state I at position t in a sequence 
generated by the Markov chain is 

F(X t = I) = ( e/ ,P*vr} = (e^P^B). (7) 

It can be seen that naive approaches to computing F(C n: k,d H A n ) result in runtimes that are exponential in 
n, the length of the alignments, which is infeasible. Efficient computation of these probabilities is helped by 
adopting the event design trick of ll24l . In detail, we define 



D 



t.k 



{Xt — I, ...,Xt + k-i — I,Xt+k 1} 



(8) 



to be the set of sequences that have a run of state I of length k that stretches from positions t to t + k — 1 
and ends at position t + k — 1, that is, the run is followed by a visit of state different from I at position t + k. 
We further define 



1 



7TI 



(PBl,PM 1 l,0,PM 2 l,PM 3 l,PEl) 



(9) 



(1 -PII 

which can be interpreted as the state the Markov chain is in if we know that the Markov chain has left state I 
at the time step before. Consider F(X t + s = 1 1 Xt—\ =l,X t ^ I) as the probability that the Markov chain is 
in state I at period t + s after having been in the state tt\ at period t (note that this probability is independent 
of t as we deal with a homogeneous Markov chain). Similarly ¥(A t+ j t+s \ D t f.) is the probability that the 
Markov chain transits from state M 3 to state E at position t + k + s + 1 while it has a run of state I of length 
k that stretches from positions t to t + k — 1 and ends at position t + k — 1. Lastly, we introduce the variables 



Qi, 



Rl, 



nx si = i) n F ( x ^ = 1 1 = J i x t + !) 

l<si ,...,sm<i i=2 
s l + ...+s m =l 

Ql,mH A t+k+L-i | A,fe), 1 < m < L < n 



(10) 



(11) 



l=m 



for 1 < m < I < n where the sum reflects summing over partitions of the integer I into m positive, not 
necessarily different, integers Sj. We then obtain the following lemma a proof of which needs a generalized 
inclusion-exlusion argument. 



Lemma 2.1. 



I -2- I 

E 

m=l 



(-1) 7 



m 
d- 



{p^il-pilT-Rn- 



mk,rri' 



(12) 
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Proof: We start by transforming 

r B (s) :=P(X S = 1) =¥(X S =I\X = B) = (ei,P s 7r), s > 1 

and 

-r^.-MY — X \ Y _ t v -J. t\ F ( X t-i =I,X t ^ I, X t+S = I) 

n( s ) ._ F(x t+S - 1 1 x t ^ - 1, x t ^ i) _ 

E»#i p (^-i = Mfr = h Xt+s = I) _ Pjgt-i = I) Ejgg = h Xt+s = 1 1 Xt-i = I) 

E*iP(*t-i = I,*t = i) ~ F(X t ^ 1 = l)Z l ^nXt = i\X t ^ 1 = I) 
F(Xt-i = I) E^i nX t = i I Xt-i = l)nXt+s = l\X t = i) 



F(X t .! = I) E^i HXt = i I X t -i = I) 
According to elementary Markov chain theory, one obtains, where here and in the following a(k) 

piTHi-ph) 

p(A, fc ) = nx t = i, Xt+fc-i = i, ^ t+fc ^ i) 
fc-i 

= ¥(X t = I) • [J ¥ (X t+l = 1 1 X t+i _i = I) • ¥(X t+k + 1 1 X t+k ^ = I) 

i=l 

= r B (t)-p I V 1 -(i-pn) 

and similarly, for t 2 > t\ 

p (A 2 ,fc I Ai,fc) = p (A2,fc I ^i+fe-i = I) x tl+k ^ I) 

= fri(t 2 - *i - fc)Pn~ H 1 " Mi) t 2 -k>h 
[0 t 2 -k<t 1 

Plugging (PT5T ) and ([TBI together yields, fori < ii < ... < t m < n — k + 1, 



P(%n...nD w ) 



'P(A llfc ) • nS 1 F (A l+1 1 AJ Vi : ti+i - U > k 



else 
, v (17) 

= f rB(ti) ■ Vu=x n(U+l ~ U - k) ■ (p*- 1 ^ - p„)) m Vi : ti +1 -U>k 

" [0 else 
Including this into the definition of the Qi m and B,L,m yields 

m 

Ql,m= ^2 r B(«l) J|n(Si) ; l<m<l< 



■n 



i<si,...,s m <l i=2 

s 1 + ...+s m = i 



and 



RL,m = ^ Ql,mrM 3 (sL- 



l=m 

We now observe that 

C nj fc 5 d = Ui<t 1< t 2 <...<t d <n-fc+i(Ai,fc n ... n D tdjk ). 



and we recall that we would like to compute 



P(C„, M n An) 



(21) 



where A n := {X n = M3, X n+ \ = E} is the set of sequences that have an alignment region of length n. 
Proceeding by inclusion-exclusion yields 

P(C„ iM n An) ® P(Ui<i 1<t2< ... < i d < n _ fc+1 (A 1 ,fc n ... n D Utk n A n )) 
= Y K m4 -¥(D tuk n...r)D tm ^nA n ) 

m=d 

= Y K m,d-W(D tuk n...nD tm}k )-F{A n \D tm>k ) 

m=d 

where LfcTTfJ reflects the number of non-overlapping events D ti , representing subsequences of length k + 1, 
that fit into a sequence of length n and 



(22) 



K, 



m,d 



-1 



>m+d 



m — 1 
d-l 



(23) 



is a generalized inclusion-exclusion coefficient. While the result can be obtained from considerations that 
are analogous to that of the usual case d = 1 (note that K m ^ = (— l) m+1 just results in the usual inclusion- 
exclusion), it is not common in the mathematical literature. See the subsequent lemma 12.21 for a formal 
statement and a proof. 

We further define, by computations that are similar to ([T41 . 



r M3 (s) := W(A t+k+s I A,fc) 

= F(X t+h+s = M 3 I D t>k ) ■ ¥(X t+k+s+1 = E I X t+k+s = M 3 ) = (e M3 , P s ^i) • Pm 3 e- (24) 



By computations that are analogous to those of ll27l . where in the following a(k) := p^ 1 (1 — pn 



{CnAd n A n ) W Y K m4 ¥(D tuk n ... n D tm , k ) ■ F(A n I A m ,fc) 



CE3 



I 1 



m— 1 



XI ^m,d • a(A;) m ^ n(ti) ■ ri(tj+i - U - k) ■ r Ma (n — t m — k) 



m=l 



l<t 1 <...<t m <n-k+l 
t i+l~ t i> k 



i=l 



Lfc+iJ Xn—mk 

y~] K m>d ■ a(k) 

m=l 

_2_ 

Lfc+lJ 

Y K m4 ■ a{k) m ■ R 

m=l 

I -2- I 

Lfc+lJ 



Y Q l > m ' r M 3 ( n -mk- I)) 



l=m 



n—mk'm 



m=l 



m — 1 



*) •(p{r 1 (i-wi)r-fl 



n—mk,m- 



(25) 
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Lemma 2.2. Let Di,i G {1, N} be a family of N events. Then it holds that 

N / _ 1\ - 

P(Ui< il <...< id < iV (A 1 n ... n AJ = E ^ m+d ( d _ J E n ... n A 



Proof. The proof proceeds similarly to that of the special, well known case of d = 1. Let ui € 
U^-cxidl^u ^ •■• l~l sucn w.l.o.g., u; is contained in L>i, ...,-D n where n < N, but not con- 
tained in D n+ i, Dn. Let l w be the indicator function of w. According to the choice of u it holds that 

, , f 1 1 < i < n 
UD ' )= { else " ' <27) 

Proceeding along the lines of the proof of the usual inclusion-exclusion theorem (d = 1) (see e.g. H) it 
suffices to show that 



i.(u 11 <...< 1 jA 1 n...nAJ) = t(-ir("_ 1 1 ) £ Un^) 

m=c( ^ ' JC{l,..,n) 



(28) 



Evaluating this equation at oj amounts to showing that 



This is done by induction on d. The case d = 1 



z=i v / \ / m= i \ / 

is the usual case of standard inclusion-exclusion which, by putting the right hand side to the left, follows 
from 

^> ir Q =(i - i)n=o - (3i) 

m=0 ^ ' 

d — > d + 1: In this case, In this case, we have to show that 

E<-ir"*(V)(;)-i 

m=d+l v / \ / 

Therefore, we can assume that 



/+d+if m - 1 \ ( n \ = ( n 

d-l)\m) \d 



(33) 



m=d m=d+l 

Furthermore, it holds that 

fm\fn\ n\ m\ n\ (n — d)\ ( n\ ( (n — d) 



d)\m) m!(n — m)\ d\(m — d)\ d\{n — d)\ (m — d)\{n — m)\ \dj\{m — d) 



(34) 
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We proceed 




which concludes the proof. o 

The consequences can be summarized in the following theorem. 
Theorem 2.1. A full table of values F(C n ^4 CI A n ), k < n < N can be computed in 0(iV 3 ) runtime. 
Proof. Observing the recursive relationship 

l-m+l 

Qi,m= Yl nXt+ s = l\X t _ 1 = I,X t ^I)Q l _ s , m _ 1 , m>l (36) 

8=1 

yields a standard dynamic programming procedure by which the ensemble of the Qi jm and the Rh,m 
(1 < m < l,L < N) can be computed in 0(N 3 ) runtime. This also requires that the values F(X S = 
T),F(Xt+ s = 1 1 Xt-\ = I, X t 7^ I) have been precomputed which can be done in time linear in N. After 
computation of the Qi )m and the Rl,™, computation of the P(C nj fc,d H A n ), 1 < k < n < N then equally 
requires 0(N 3 ) time which follows from lemma I27TI o 

Results 

Data We downloaded both the "Superfamilies" (Sup) and "Twilight Zone" (Twi) datasets together with 
their structural alignment information from SABmark 1.65 l28l . including the suggested false positive pairs 
(that is structurally unrelated, but apparently similar sequences, see ll28l for a detailed description). While 
Sup is a more benign set of structural alignments where protein pairs can be assumed to be homologous and 
which contains alignments of up to 50% identity, Twi is a worst case scenario of alignments between only 
0-25 % sequence identity where the presence of a common evolutionary ancestor remains unclear. 

To calculate pairwise global resp. local alignments we used the "GGSEARCH" resp. "LALIGN" tool 
from the FASTA sequence comparison package ll23l . As a substitution matrix, BLOSUM50 (default) 
was used. GGSEARCH resp. LALIGN implement the classical Needleman-Wunsch (NW) resp. Smith- 
Waterman (SW) alignment algorithm both with affine gap penalties. We subsequently discarded global 
resp. local alignments of an e-value larger than 10.0 resp. 1.0, as suggested as a default threshold setting 
[23 ], in order to ensure to only treat alignments which can be assumed not to be entirely random. 
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Table 1 : Markov chain parameters for local alignments. 



Twilight Zone (Twi) 



Similarity (%) 


20-30 


30-40 


40-50 


50-60 


60-70 


70-80 


80-90 


90 - 100 


No. Alignments 




27 


1896 


12512 


9716 


3956 


1733 


259 


3i 




0.9552 


0.9606 


0.9564 


0.9485 


0.9364 


0.9188 


0.9149 


32 




0.0216 


0.0300 


0.0315 


0.0265 


0.0167 


0.0078 


0.0036 


33 




0.7500 


0.6667 


0.5893 


0.4692 


0.3210 


0.3640 


0.0000 


34 




0.0588 


0.1948 


0.2185 


0.1739 


0.0979 


0.0459 


0.0000 


35 




0.8261 


0.9439 


0.9353 


0.9226 


0.8999 


0.9253 


1.0000 


36 




0.9417 


0.9514 


0.9472 


0.9335 


0.9077 


0.8991 


0.7755 


Superfamilies (Sup) 


Similarity (%) 


20-30 


30-40 


40-50 


50-60 


60-70 


70-80 


80-90 


90 - 100 


No. Alignments 




44 


3743 


23726 


18633 


7275 


2613 


454 


3i 




0.9511 


0.9584 


0.9568 


0.9534 


0.9528 


0.9346 


0.9273 


32 




0.0267 


0.0330 


0.0330 


0.0266 


0.0160 


0.0085 


0.0034 


33 




0.7643 


0.6829 


0.6043 


0.5001 


0.4044 


0.2553 


0.0000 


34 




0.0828 


0.1962 


0.2390 


0.2390 


0.2220 


0.0959 


0.0000 


35 




0.8952 


0.9430 


0.9410 


0.9466 


0.9674 


0.9820 


0.0000 


36 




0.9438 


0.9508 


0.9495 


0.9443 


0.9504 


0.9407 


0.7921 



Table 2: Markov chain parameters for global alignments. 



Twilight Zone (Twi) 



Similarity (%) 


20-30 


30-40 


40-50 


50-60 


60-70 


70-80 


80-90 


90 


- 100 


No. Alignments 


31 


1811 


9156 


616 


36 


7 


8 






1 -2p 


0.9092 


0.9290 


0.9287 


0.9311 


0.9528 


0.9790 


0.9939 






3 


0.2615 


0.1835 


0.1475 


0.0994 


0.0619 


0.0269 


0.0364 






Superfamilies (Sup) 


Similarity (%) 


20-30 


30-40 


40-50 


50-60 


60-70 


70-80 


80-90 


90 


- 100 


No. Alignments 


44 


2925 


17127 


3234 


1304 


454 


39 






1 -2p 


0.9054 


0.9277 


0.9292 


0.9421 


0.9630 


0.9788 


0.9900 






3 


0.2528 


0.1876 


0.1482 


0.0980 


0.0523 


0.0257 


0.0097 







We then subdivided the resulting 4 groups (NW Twi, NW Sup, SW Twi and SW Sup) of computational 
alignments into pools of alignments of similarity in [a, a+ 10] where a ranged from 20 to 90. We then trained 
parameters (using also the false positive SABmark alignments in order to obtain unbiased null models) 
for the 36 = 4 x 9 different Markov chains (2-state as in ll27l resp. 6-state as described here for global 
resp. local) and computed probability tables as described in the Methods section. After computation of 
probability tables, false positive alignments were discarded. See below for Markov chain parameters. 

The remaining (non false-positive) NW Twi, NW Sup, SW Twi and SW Sup alignments contained 
179018, 407629, 20853 and 86233 gap positions contained in 122701, 276082, 17776 and 68513 gaps. In 
the global alignments this includes also initial and end gaps. 

Evaluation Strategies 

Based on efficient computation of probabilities of the type (fj]) we devise strategies SigD(#) for predicting 
indel reliability in NW and SW alignments where D = 1, 4, 7. Let K be the length of the L-th longest indel 
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Smith-Waterman Twi 



Smith-Waterman Sup 




_i i i i i i i i i_ 

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 

Recall 

Needleman-Wunsch Twi 



0.73 
0.72 
0.71 
0.7 
0.69 
0.68 
0.67 
0.66 
0.65 
0.64 
0.63 
0.62 



— i 1 1 1 1 1 1 1 r 

. "Sig-7" 

"Sig-4" 

"Sig-1" 

"Const" 




O 
CO 

'o 

CD 



O 
CO 

'o 

CD 




_l I I I I I I I l_ 

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 

Recall 

Needleman-Wunsch Sup 



1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 

Recall 



0.78 
0.76 
0.74 
0.72 
0.7 
0.68 
0.66 
0.64 
0.62 



— i 1 1 1 1 1 1 1 r 

"Sig-7" 

"Sig-4" 

"Sig-1" 
"Const" 




i i i i i i i i i 

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 

Recall 



Figure 3: Precision-Recall curves for the different sets of computational alignments. Recall is lowered 
through lowering the significance threshold 9 for the strategies SigD(#) (9 = 1.0 for maximal recall of 1.0) 
and for raising indel length in the baseline strategy Const (length = 1 for maximal recall of 1.0). 



in the NW resp. SW alignment of proteins x, y. In strategy SigD(#), tn i s indel is classified as reliable if 

Sig D (0) : Wxmn{D,L){x,y) > K \ L(x, y), Sim(x, y)) < 9. (37) 

In other words, we look up whether it is significant that an alignment of length L(x, y) and similarity 
Sim(x, y) contains at least L resp. D, in case of D > L resp. D < L, indels of size K. Note that, since 
L > 1 hence mm(D, L) = 1, in strategy Sigi (t9) an indel of length K is evaluated as reliable if and only 
if the indel is significantly long without considering its relationship with the other gaps in the alignment. 
This is different for strategy Sig7(#) where, for example, the 6-th longest indel is evaluated as reliable if 
it is significant to have at least 6 indels of that length (min(D, L) = 6) whereas the 8-th longest indel is 
supposed to be reliable if there are at least 7 indels of that length (min(7J, L) = 7). Note that in strategy 
Sig7(t9) already shorter indels are classified as reliable in case that there are many indels of that length in 
the alignment which is not the case in strategy Sigi(#). Clearly, raising D beyond 7 might make sense. For 
sake of simplicity only, we restricted our attention to D = 1, 4, 7. 

As a simple baseline method we suggest Const which considers an indel as reliable if its length exceeds 
a constant threshold. Both raising the constant length threshold in Const and lowering 9 in SigD(#) lead to 
reduced amounts of indels classified as reliable. 

Evaluation Measures We found that for both global and local alignments further evaluation of gaps of 
length at most 4 and length greater than 30 (global) resp. 20 (local) did not make much sense (see table 13.11 
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NWTwi 


NWSup 


SWTwi 


SWSup 




< 4 


> 30 


< 4 


> 30 


< 4 


> 20 


< 4 


> 20 


FGP 


0.42 


0.03 


0.41 


0.01 


0.63 


0.01 


0.53 


0.02 


PPV 


0.58 


0.64 


0.53 


0.92 


0.53 


0.69 


0.45 


0.87 



Table 3: Fractions of Gap Positions (FGP) contained in gaps of different length ranges and Fractions of True 
Gap Positions (PPV) contained in such gaps 



Recall 




-lo g 




IL 




-log 9 




IL 




-log 9 




IL 




-log 9 




IL 


1.0 


0.0 


0.0 


0.0 


5 


0.0 


0.0 


0.0 


5 


0.0 


0.0 


0.0 


5 


0.0 


0.0 


0.0 


5 


0.75 


2.0 


2.0 


1.0 


5 


2.0 


2.0 


1.0 


6 


19.0 


18.5 


6.5 


6 


21.0 


19.5 


7.0 


6 


0.5 


2.5 


2.5 


1.5 


6 


3.0 


3.0 


1.5 


7 


28.0 


24.0 


10.5 


8 


30.0 


27.0 


11.5 


8 


0.25 


3.5 


3.5 


2.5 


8 


4.5 


4.5 


3.0 


10 


38.5 


33.0 


16.0 


11 


41.5 


36.0 


18.0 


11 




Sig7 


Sig4 


Sigi 


C 


Sig7 


Sig4 


Sigi 


C 


Sig 7 


Sig4 


Sigi 


C 


Sig 7 


Sig4 


Sigi 


C 






SW Twi 






SW Sup 






NWTwi 






NW Sup 





Table 4: Relationship between Recall and 9 (displayed as — log 9) for strategies SigD and indel length (= 
IL) for strategy Const (= C). 



for statistics). However, for gaps of length ranging from 5 to 20 resp. 30 in local resp. global alignments a 
significance analysis made sense. 

We evaluated the indel positions in gaps of length 5 — 20 resp. 5 — 30 in local resp. global alignments 
by defining a true positive (TP) to be a computational gap position which is classified as reliable (meaning 
that it is found to be significant by SigD(#), D = 1,4, 7 or long enough by Const) and coincides with a true 
structural indel position in the reference structural alignment as provided by SABmark. Correspondingly, a 
false positive (FP) is a gap position classified as reliable which cannot be found in the reference alignment. 
A true negative (TN) is a gap position not classified as reliable and not a structural indel position and a false 
negative (FN) is not classified as reliable but refers to a true structural indel position. Recall, as usual, is 
calculated as TP/ (TP + FN) whereas Precision (also called PPV=Positive Predictive Value) is calculated 
as TP I '{TP + FP). 

Discussion of Results 

Results are displayed in Figure [3] where we have plotted Precision vs. Recall while lowering 9 for the 
strategies SigD(#) and increasing indel length for the baseline method Const. While Recall = 1.0 relates to 
9 = 1.0 in the strategies SigD maximal recall relates to indel length 5 in the strategy Const. Table @]displays 
further supporting statistics on the relationship between choices of 9 resp. indel length and Recall. 

A first look reveals that indel reliability clearly increases for increasing indel length — longer indels are 
more likely to contain true indel positions. However, further improvements can be achieved by classifying 
indels as reliable according to significance. For the Sup alignments improvements over the baseline method 
are only slight. For both local and global alignments strategy Sigi is an option in particular when it comes 
to achieving utmost precision which can be raised up to 0.8. For the Twi alignments differences are obvious. 
More importantly, just considering indel length without evaluating multiplicity does not serve to achieve 
substantially increased Precision. Here, multiplicity is decisive which in particular confirms the findings on 
twilight zone alignments reported in ESI . In the Twi alignments Precision can be raised up to about 0.7. 
Note that [29] achieve 0.7 Precision on both match and gap positions for structural alignments (not from 
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SABmark) of 25 — 30% identity while reporting that their evaluation does not work for alignments of less 
than 25% identity which renders it not applicable for the Twi alignments. The posterior decoding aligner 
FSA which outperformed all other multiple aligners in terms of Precision on both (mis)match and gaps in 
the entire SABmark dataset, comprising both Sup and Twi Q report Precision of 0.52 (all other aligners 
fall below 0.5) without further re-evaluation of their alignments. This lets us conclude that our statistical 
re-evaluation makes an interesting complementary contribution to alignment re-evaluation. 

Conclusion 

Most recent studies have again pointed out that computational alignments of all kinds need further re- 
evaluation in order to avoid detrimental effects in downstream analyses of comparative genomics studies. 
While exact gap placement is at the core of aligning sequence positive prediction rates are worst within 
or closely around inferred indels. Here we have systematically addressed that indel size and multiplicity 
can serve as indicators of alignment artifacts. We have developed a pair HMM based statistical evalua- 
tion pipeline which can soundly distinguish between spurious and reliable indels in alignments with affine 
gap penalties by measuring indel significance in terms of indel size and multiplicity. As a result we are 
able to reliably identify indels which are more likely to enclose true structural indel positions as provided 
by SABmark, raising positive prediction rates up to 0.7 even for worst-case twilight zone alignments of 
maximal 25% sequence identity. Since previous approaches predominantly addressed re-evaluation of 
match/mismatch positions we think that we have made a valuable, complementary contribution to the is- 
sue of alignment re-evaluation. Future work of ours is concerned with re-evaluation of pair HMM based 
posterior decoding aligners which have proven to be superior over score-based aligners in a variety of as- 
pects. 
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